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Abstract — We present a novel local improvement scheme for 
the perfectly balanced graph partitioning problem. This scheme 
encodes local searches that are not restricted to a balance 
constraint into a model allowing us to find combinations of 
these searches maintaining balance by applying a negative 
cycle detection algorithm. We combine this technique with 
an algorithm to balance unbalanced solutions and integrate 
it into a parallel multilevel evolutionary algorithm, KaFFPaE, 
to tackle the problem. Overall, we obtain a system that is 
fast on the one hand and on the other hand is able to 
improve or reproduce most of the best known perfectly balanced 
partitioning results ever reported in the literature. 



I. Introduction 

In computer science, engineering, and related fields graph 
partitioning is a common technique. For example, in parallel 
computing good partitionings of unstructured graphs are 
very valuable. In this area, graph partitioning is mostly used 
to partition the underlying graph model of computation and 
communication. Roughly speaking, nodes in this graph rep- 
resent computation units and edges denote communication. 
This graph needs to be partitioned such that there are few 
edges between the blocks (pieces). In particular, if we want 
to use k processors we want to partition the graph into k 
blocks of about equal size. In this paper we focus on the 
perfectly balanced version of the problem that constrains 
the maximum block size to average block size and tries to 
minimize the total cut size, i.e. the number of edges that run 
between blocks. In practice the perfectly balanced version 
of this problem is important for small graph models where 
nodes stand for a large amount of computation. If this graph 
needs to be partitioned into a large number of blocks, e.g. for 
a large number of processors, then already a small amount 
of overloaded vertices can yield bad load imbalance. 

During the last years we started to put all aspects of the 
multi-level graph partitioning (MGP) scheme on trial since 
we had the impression that certain aspects of the method 
are not well understood. Our main focus is partition quality 
rather than partitioning speed. In our sequential MGP frame- 
work KaFFPa (Karlsruhe Fast Flow Partitioner) QjQ , we pre- 
sented novel local search as well as global search algorithms 
similar to the strategies used in the multigrid community. In 
the Walshaw benchmark [21 j, KaFFPa was beaten mostly 
for small graphs that combine multilevel partitioning with an 



evolutionary algorithm. We therefore developed an improved 
evolutionary algorithm, KaFFPaE (KaFFPa Evolutionary) 
fl9l . that also employs coarse grained parallelism. Both of 
these algorithms are able to compute partitions of very high 
quality in a reasonable amount of time when some imbalance 
e > is allowed. However, they are not yet very good for 
the perfectly balanced case e = 0. In the perfectly balanced 
case state-of-the-art local search algorithms are restricted to 
find nodes to be exchanged between a pair of blocks in order 
to decrease the cut and to maintain perfect balance. Hence, 
we introduce new specialized techniques for the perfectly 
balanced case in this paper. Experiments indicate that the 
techniques are also useful if some imbalance is allowed. 
From a meta heuristic point of view the proposed algorithms 
increase the neighborhood of a perfectly balanced solution in 
which local search is able to find better solutions. Moreover, 
we provide efficient ways to explore this neighborhood. As 
we will see, these algorithms guarantee that the output par- 
tition is perfectly balanced whereas current solvers basically 
do not guarantee perfect balance. 

Although the problem is NP-hard [7] and hard to approx- 
imate on general graphs |7] an astonishingly large set of 
"easier" graph algorithms are used to tackle the problem. For 
example algorithms such as weighted matching, spanning 
trees, edge coloring, breadth first search, maximum flows, 
diffusion and strongly connected components. In this paper 
this list is further augmented by two well known algorithms: 
negative cycle detection and shortest path algorithms allow- 
ing negative edge weights. 

The paper is organized as follows. We begin in Sec- 
tion [n] by introducing basic concepts. After shortly pre- 
senting Related Work in Section [Hi] we describe novel 
perfectly balanced local search and balancing algorithms 
in Section IV Here we start by explaining the very basic 



idea that allows us to find combinations of simple node 
movements. We then explain directed local searches and 
extend the basic idea to a complex model containing more 
node movements. This is followed by a description on how 
these techniques are integrated into KaFFPaE. A summary 
of extensive experiments done to evaluate the performance 
of our algorithms is presented in Section [V] 



II. Preliminaries 

Consider an undirected graph G = (V, E, oS) with edge 
weights uj : E -> M >0 , n = |V|, and m = We 
extend w to sets, i.e., u(E') :— J2 e <£E' w ( e )- ^(v) : = 
{it : {v, u} £ E} denotes the neighbors of v. We are look- 
ing for blocks of nodes Vi,. ..,14 that partition V, i.e., 
Vi U • • • U 14 = V and 1^ n V, = for i ^ j. A balancing 
constraint demands that Vi € {l..fc} : |1^| < L max := 
(1 + e) |"| V|/fc] . In the perfectly balanced case the imbalance 
parameter e is set to zero. A graph is perfectly fc-divisible if 
r|V|/fc] = |V|/&. The objective is to minimize the total cut 
J2 t<J w{Eij) where := {{u, v} £ E : u £ V h v £ Vj}. 
A node v £ Vi that has a neighbor w £ Vj,i ^ j, is a 
boundary node. An abstract view of the partitioned graph is 
the so called quotient graph, where nodes represent blocks 
and edges are induced by connectivity between blocks. An 
example is shown in Figure [T] Given a partition, the gain of 
a node v in block A with respect to a block B is defined 
as 9(A,B) = w)\w £ T(v) n B}) - w({(«, w) \ w £ 

T(v) n A), i.e. the reduction in the cut when v is moved 
from block A to block B. By default, our initial inputs will 
have unit node weights. However, the algorithms proposed 
in this paper can be easily extended to weighted nodes. 
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Fig. 1. A graph that is partitioned into three blocks of size four on the 
left and its corresponding quotient graph on the right. There is an edge in 
the quotient graph if there is an edge between the corresponding blocks in 
the original graph. 

A successful heuristics for partitioning large graphs is 
the multilevel approach. Here, the graph is recursively con- 
tracted to achieve a smaller graph with the same basic 
structure. After applying an initial partitioning algorithm 
to the smallest graph in the hierarchy, the contraction is 
undone and, at each level, a local refinement method is used 
to improve the partitioning induced by the coarser level. 

III. Related Work 

There has been a huge amount of research on graph 
partitioning so that we refer the reader to (6), fl2l . Il23l . 
All general purpose methods that are able to obtain good 
partitions for large real world graphs are based on the 
multilevel principle outlined in Section [II] Well known soft- 
ware packages based on this approach include, Jostle ll23l . 
Metis [14|, and Scotch ifTTl . However, for different reasons 
they are not able guarantee that the produced partition is 
perfectly balanced. 



KaFFPa [18] is a multi-level graph partitioning algo- 
rithm using local improvement algorithms that are based 
on flows and more localized FM searches. KaFFPaE |fl9l 
is a distributed parallel evolutionary algorithm that uses 
our multilevel graph partitioning framework KaFFPa ifTHl 
to create individuals and modifies the coarsening phase to 
provide new effective combine operations. It currently holds 
the best results for many graphs in Walshaw's Benchmark 
Archive ll2D when some imbalance is allowed. KaPPa lfl3l 
is a "classical" matching based multi-level graph partitioning 
algorithm designed for scalable parallel execution. 

DiBaP lfT31 is a multi-level graph partitioner where local 
improvement is based on diffusion. It currently holds some 
of the best results in the perfectly balanced case for large 
graphs in Walshaw's Benchmark Archive [21 1. Benlic et 
al. 0, E), provided multilevel me me tic algorithms for 
perfectly balanced graph partitioning. Their approach is able 
to compute many entries in Walshaw's Benchmark Archive 
ET1 for the case e = 0. However, they are not able to 
guarantee that the computed partition is perfectly balanced 
especially for larger values of k. PROBE [8| is a meta- 
heuristic which can be viewed as a genetic algorithm without 
selection. It is restricted to the case k = 2 and e = 0. 

IV. Perfectly Balanced Local Search by 
Negative Cycle Detection 

In this section we describe our local search and bal- 
ancing algorithms for perfectly balanced graph partitioning. 
Roughly speaking, all of our algorithms consist of two 
components. The first component are local searches on 
pairs of blocks that share a non-empty boundary, i.e. all 
edges in the quotient graph. These local searches are not 
restricted to the balance constraint of the graph partitioning 
problem and are undone after they have been performed. 
The second component uses the information gathered in the 
first component. That means we build a model using the 
node movements performed in the first step enabling us to 
find combinations of those node movements that maintain 
balance. 

We begin by describing the very basic algorithm and 
go on by presenting an advanced model which enables us 
to combine complex local searches. This is followed by a 
description on how local search and balancing algorithms 
are put together. At the end of this section we show how we 
integrate these algorithms into our evolutionary framework 
KaFFPaE. 

A. Basic Idea - Using A Negative Cycle Detection Algorithm 

We are now ready to explain the basic idea for a balanced 
local search algorithm. As we will see, local searches are 
fairly simple in this case. Before we start we introduce two 
notations: a node in the graph G can have two states marked 
and unmarked. By default a node is unmarked. A node is 
called eligible if it is not adjacent to a previously marked 
node. 
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Fig. 2. On top left an example graph that is partitioned into three 
parts (A, B and C) of four nodes each. Possible candidates for movement 
are highlighted. On top right the corresponding model is shown and one 
negative cycle is highlighted. On the bottom the updated partition after the 
node movements associated with the cycle are performed is shown. Moved 
nodes are highlighted. The reduction in the number of edges cut is equal 
to the weight of the cycle. 

We can now build the model of the underlying partition 
of the graph G, Q = ({1, ■■■ ,k},£) where (A, B) € £ if 
there is an edge in G that runs between the blocks A and B. 
We define edge weights ujq : £ — > K in the following way: 
for each directed edge e = (A, B) G £ in a random order, 
find a eligible boundary node v in block A having maximum 
gain g lmx (A, B), i.e. a node v that maximizes the reduction 
in cut size when moving it from block A to block B. If there 
is more than one of such nodes we break ties randomly. The 
node is then marked. This is basically all the local search 
that is done in the basic algorithm. The weight of e is then 
defined as the negative gain value of its associated node v, 
i.e. wg(e) := — g mlix (A, B). It is important to notice that a 
forward edge (A, B) does not have to have the same weight 
as its backward edge (B, A). An example partitioned graph 
with a basic model is shown in Figure [2] Observe that the 
basic model is basically a directed and weighted version 
of the quotient graph and that the selected nodes form an 
independent set. Note that each cycle in this model defines a 
set of node movements and furthermore when the associated 
nodes of a cycle are moved, then each block contains the 
same number of nodes as before. Also the weight of a cycle 
in the model is equal to the reduction in the cut when the 
associated node movements are performed. However, the 
most important aspect is that a negative cycle in the model 
corresponds to a set of node movements that will decrease 
the overall cut and maintain the balance of the partition. To 
detect a negative cycle in this model we introduce a node s 
and connect it to all nodes in Q. The weight of the inserted 
edges is set to zero. We can apply a standard shortest path 
algorithm [9| that can handle negative edge weights to detect 
a negative cycle. If the model contains a negative cycle we 
can perform a set of node movements that will not alter 
block sizes since each block obtains and emits a node. 



When starting with an unbalanced partitioning, i.e. a 
partition of the graph that is not perfectly balanced, or if the 
graph is not perfectly fc-divisible, the block weight invariant 
still holds. However, we can add a slight extension to the 
model. That is we connect each node in Q that corresponds 
to a block which can take at least one node without becoming 
overloaded to s by an edge that has weight zero. Note that a 
negative cycle containing s may alters some block weights 
and can lead to a balancing operation reducing the cut. In 
fact such a cycle can correspond to a set of node movements 
starting in an arbitrary block and ending in a block that 
can take a node without becoming overloaded (basically a 
path in the quotient graph). However, if there is no negative 
cycle in the model we have to think about diversification and 
balancing strategies which is done in the following sections. 
An interesting observation is that the algorithm can be seen 
as an extension of the classical FM algorithm that swaps 
nodes between two adjacent blocks (two at a time) which is 
basically a negative cycle of length two in our model if the 
gain of the two node movements is positive. 

Diversification by Zero Gain Cycle Moves: A zero weight 
cycle in the basic model is associated with a set of node 
movements that keep the cut unchanged and the block 
weights constant. After such a movement is performed it 
might be possible to find further negative cycles in the basic 
model since candidates for movements and gain values may 
have changed. Hence these cycles can be useful to introduce 
some diversification. 

Nonetheless, on general graphs it is NP-complete to 
decide whether a weighted graph contains a cycle that has 
weight zero, i.e. the sum of the edge weights of this cycle 
is zero. However, we will see that if a graph does not 
contain a negative cycle we can decide whether it contains 
a cycle of weight zero in polynomial time and output one 
if one exists. This can be done by using the following 
technique. As soon as the model described above does 
not contain negative cycles we compute a shortest path 
tree starting at s. By doing this we obtain node potentials 
n : {1, • • • , fc} — > R, i.e. the shortest path distances from 
s to all other nodes. We then define modified edge weights 
£ Q (e = (A, B)) = uj Q (e) + 11(A) - U{B). It is quite easy 
to see that the weight of a cycle in Q does not change 
when we use £q instead of ujq. In particular cycles that 
have weight zero w.r.t ujq will have weight zero w.r.t. £q. 
Another important observation is that £q is a non-negative 
function. Hence, in order to detect a zero weight cycle we 
can evict all edges e with ^g(e) > since they cannot be 
a part of a cycle having weight zero. After this is done we 
compute all strongly connected components of this graph. 
If there is a strongly connected component that contains 
more than one node then the graph contains a cycle that 
has weight zero. To output one zero weight cycle we pick 
a random node N out of the components having more than 
one node. Starting at this node we perform a random walk 



in its component until we find a node that we have already 
seen M (which is not necessary TV). It is then fairly simple 
to output the respective cycle starting at M. Note that if the 
component contains j nodes than the random walk will stop 
after at most j iterations. As soon as we have found a cycle 
of weight zero we can perform the node movements that are 
associated with the edges of the cycle. 

B. Advanced Model 

Our advanced model is strongly coupled with advanced 
local search algorithms. Roughly speaking, each edge in 
the advanced model is associated with a whole set of 
node movements which have been found by local searches. 
Hence, a negative cycle in this model will correspond to a 
combination of local searches with positive overall gain that 
maintain balance or that can improve balance. 

Before we build the advanced model we perform a single 
directed local search on each pair of blocks that share a non- 
empty boundary, i.e. each pair of blocks that are adjacent 
in the quotient graph. A local search on a directed pair of 
blocks (A, B) is only allowed to move nodes from block 
A to block B. The order in which the directed local search 
between a directed pair of blocks is performed is random. 
That means we pick a random directed adjacent pair of 
blocks on which local search has not been performed yet 
and perform local search as described below. This is done 
until local search was done between all directed adjacent 
pairs of blocks once. 

Directed Local Search: We now explain how we perform 
a directed local search between a pair (A, B) of blocks. 
A directed local search between two blocks A and B is 
very localized akin to the multi-try method used in KaFFPa 
lfl8l . However, a directed local search between A and B 
is restricted to move nodes from block A to block B. It 
is similar to the FM-algorithm ifTTI : We start with a single 
random eligible boundary node of block A having maximum 
gain g m!a (A, B) and put this node into a priority queue. The 
priority queue contains nodes of the block A that are valid 
to move. The priority is based on the gain, i.e. the decrease 
in edge cut when the node is moved from block A to block 
B. We always move the node that has the highest priority 
to block B. After a node is moved its eligible neighbors 
that are in block A are inserted into the priority queue. We 
perform at most r steps per directed local search, where r is 
a parameter. Note that during a directed local search we only 
move nodes that are not incident to a node moved during a 
previous directed local search. This restriction is necessary 
to keep the model described below accurate. Thus we mark 
all nodes touched during a directed local search after it was 
performed which as well implies that each node is moved at 
most once. In addition all moved nodes are moved back to 
their origin, since these movements would make the partition 
imbalanced. We stress that all nodes incident to nodes that 
have been moved during a directed local search are not 
eligible for any later local search during the construction 




Fig. 3. On top a graph that is partitioned into three parts (\A\ = 14, \B\ = 
12, |C| = 14). Directed local searches on each directed pair of blocks are 
highlighted (r = 3). The corresponding advanced model is shown on the 
bottom. Each layer is a copy of the quotient graph of the partition. Edges 
within layer d represent node movements consisting of d nodes that have 
been found previously using directed local search, s is connected to all 
nodes (most of the edges are not shown), edges back to s are inserted if 
the corresponding block can take some nodes without becoming overloaded 
(in this example block B), backward edges between layers are inserted 
if the block can take nodes without becoming overloaded, forward edges 
between the layers are inserted in any case. Within layer 3 a negative cycle is 
highlighted (red/dark dashed) which corresponds the movement of the nine 
red/dark nodes on top. Another negative cycle is highlighted in yellow/light 
grey dashed. It corresponds to the movement of the five yellow/light gray 
nodes on top. The weight of both cycles is -2. After these movements are 
performed the resulting partition is perfectly balanced. 

since this would make the gain values computed imprecise. 

The Model: The advanced model allows us to find com- 
binations of directed local searches such that the balance of 
the given partition at least maintained. Specifically a negative 
cycle in the model represents a set of node movements that 
maintain or improve balance while the number of edges 
cut is reduced. Roughly speaking, most of the edges in the 
advanced model will be associated with a subset of a directed 
local search that we was computed above. 

The local search process described above yields for each 
pair of blocks e = (A, B) in the quotient graph a sequence 
of node movements S e G V T and a sequence of gain 
values g e e R T . Here f is smaller or equal to r, the 
maximum number of node movements allowed for a single 
directed local search. The cfth value in g e corresponds to 
the reduction in the cut between the pair of blocks (A, B) 
when the first d nodes in S„ are moved from their source 



block A to their target block B. By construction, a node 
v E V can occur in at most one of the sequences created 
and in its sequence only once. 

Roughly speaking, the advanced model consists of t 
layers. Essentially each layer is a copy of the quotient 
graph. An edge starting and ending in layer d of this model 
corresponds to the movement of exactly d nodes. The weight 
of an edge e = (A, B) in layer d of the model is set 
to the negative value of the c?'th entry in g e if \g e \ > d 
otherwise the edge is removed from that layer. In other words 
it encodes the negative value of the gain, when the first d 
nodes in S e are moved from block A to block B. Hence, 
a negative cycle whose nodes are all in layer d will move 
exactly d nodes between each of the respective block pairs 
contained in the cycle and results in a overall decrease in 
the edge cut. We add additional edges to the model such 
that it contains more moves in the case of an imbalanced 
input partition or if the graph is not perfectly /c-divisible. 
To be more precise in these cases we want to get rid of the 
restriction that each block sends and emits the same amount 
of nodes. To do so we insert forward edges between all 
consecutive layers, i.e. block k in layer d is connected by 
an edge of weight zero to block k in layer d+1. These edges 
are not associated with node movements. Furthermore, we 
add backward edges as follows: for an edge (A, B) in layer 
d we add an edge with the same weight between block A 
in layer d and block B in layer d — £ if block B can take 
I nodes without becoming overloaded. The newly inserted 
edge is associated with the same node movements as the 
initial edge (^4, B) within layer d. This way we encode 
movements in the model where a block can emit more nodes 
then it gets and vice versa without violating the balance 
constraint. Additionally we connect each node in layer d 
back to s if the associated block can take at least d nodes 
without becoming overloaded. As in the basic model this 
ensures that the model might contain cycles through s. That 
means that we also can find cycles corresponding to paths 
in the quotient graph being associated with node movements 
that decrease the overall cut. Moreover, these moves never 
increase the imbalance of the input partition. An example 
advanced model is shown in Figure [3] Note that if the input 
partition is perfectly balanced and the graph is perfectly k- 
divisible then the backward edges, including those back to s, 
are not contained in the model. Also note that the zero gain 
diversification can also be applied in the advanced model. 

Packing: The algorithm can be further improved by 
performing/packing multiple directed local searches between 
each pair of blocks that share a non-empty boundary. To be 
more precise after we have computed node movements on 
each pair of blocks e = (A, B) we start again using the 
nodes that are still eligible. This is done n times. We say 
H is the number of packing iterations. The model is then 
slightly modified in the following way: For the creation of 
edges in the model that correspond to the movement of d 
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Fig. 4. The first type of conflict that can occur in the advanced model. In 
this example two layers of the advanced model corresponding to graph in 
Figure [3] are shown. Only the edges of a conflicted cycle are drawn. The 
problem of the cycle are highlighted edges running in layer two and three 
from the nodes representing block B to the nodes representing block C. 
They are associated with node movements where subsets are equal. The 
drawn cycle does not correspond to a simple cycle in the quotient graph. 



nodes from block A to block B we use the directed local 
search on e = (A, B) from the process above with the best 
overall gain when moving d nodes from block A to block B 
(and use this gain value for the computation of the weight 
of corresponding edges). 

Conflicts: The advanced model is a bit problematic since 
it contains two types of conflicts due to the edges that run 
between the layers. First the model can contain cycles that 
do not correspond to a simple cycle in the quotient graph. 
Such a cycle is problematic because it contains the same 
edge e = (A, B) in the quotient graph multiple times. 
An example is given in Figure |4] Let us assume that one 
associated edge runs in layer d and one in layer I with I < d. 
The associated node movements cannot be performed fully 
since the edges correspond to subsets of the same directed 
local search. This is due to the fact that the edge in layer £ 
corresponds to the movement of the first £ nodes in S e . These 
movements are a subset of the node movements associated 
with the edge in layer d, which corresponds to movement 
of the first d nodes in S e . In other words, when we want to 
move the nodes associated with the edge in layer d then they 
are already in block B if the node movements of the edge in 
layer £ have been performed before and vice versa depending 
on the order of execution. That means that for at least one 
of those two edges its weight does not correspond to the 
reduction in the cut of the underlying node movements. 
Hence, the weight of the cycle does not reflect the reduction 
in the number of edges cut. 

Secondly since we have both, edges between the layers 
and the edges back to s, a cycle in the model can lead to 
node movements that overload a block. A example is given 
in Figure [5] A conflict can only occur if we have edges 
running between the layers which is only the case when 
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Fig. 5. The second type of conflict that can occur in the advanced model. 
In this example two layers of the advanced model from Figure[3]are drawn. 
Only the edges of a conflicted cycle are shown. The edge in layer one from 
the node representing block B back to s was created because block B can 
take one node without becoming overloaded (\B\ = 12, |V|/fc = 13). 
For the same reason there is the edge between the layers from the node 
representing block A in layer two to the node representing block B in layer 
one. In the model there is no edge from block B in layer two back to s since 
block two can only take one node without becoming overloaded. However, 
when performing the associated node movements block B receives two 
vertices from block A and is overloaded afterwards. 

we start with an unbalanced input partition or if the graph 
is not perfectly fc-divisible. Our experiments indicate that 
conflicts do not occur very often. Furthermore, a conflicted 
cycle is easily detected. We can simply check if the cycle 
in the model is a simple cycle in the quotient graph or if 
one block would get overloaded when performing the node 
movements of that cycle. If our algorithm returns a cycle that 
contains a conflict we remove a random edge of the cycle 
in the model and start the negative cycle detection strategy 
again. Note that if we remove all edges in the model that 
run between the layers then the model is conflict-free but 
encodes less possible combinations of node movements. 

Balancing: As we will see in Section |IV-C| to create 
perfectly balanced partitions we start our algorithm with an 
e-balanced partition, i.e. a partition where larger imbalance 
is allowed. Hence, to achieve perfect balance we have to 
think about balancing strategies. A balancing step will only 
be applied if the model does not contain a negative cycle 
(see next section for more details). Hence, we can modify 
the advanced model such that we can find a set of node 
movements that will decrease the total number of overloaded 
nodes by at least one and minimizes the increase in the 
number of edges cut. Specifically, we introduce a second 
node t. Now instead of connecting s to all vertices, we 
connect it only to nodes representing overloaded blocks, 
i.e. \Vi\ > \\V\/k~\. Additionally, we connect a node in 
layer £ to t if the associated block can take at least I nodes 
without becoming overloaded. Since the underlying model 
does not contain negative cycles we can apply a shortest 
path algorithm to find a shortest path from s to t. We use 
a variant of the algorithm of Bellman and Ford since edge 
weights might still be negative (for more details see Section 
\V\ . It is now easy to see that a shortest path in this model 



yields a set of node movements with the smallest loss in 
number of cut edges and that the total number of overloaded 
nodes decreases by at least one. If r is set to one we call this 
algorithm basic balancing otherwise advanced balancing. 

However, we have to make sure that there is at least one 
s-t path in the model. Let us assume for now that the graph 
is connected. If the graph is connected then the directed 
version of the quotient graph is strongly connected. Hence a 
s-t path exists in the model if we are able to perform local 
search between all pairs of blocks that share a non-empty 
boundary. Because a directed local search can only start from 
an eligible node we might not be able to perform directed 
local search between all adjacent pairs of blocks, e.g. if there 
is no eligible node between a pair of blocks left. We try to 
ensure that there is at least one s-t path in the model by 
doing the following. Roughly speaking we try to integrate a 
s-t path in the model by changing the order in which directed 
local searches are performed. First we perform a breadth first 
search (BFS) in the quotient graph which is initialized with 
all nodes that correspond to overloaded blocks in a random 
order. We then pick a random node in the quotient graph 
that corresponds to a block A that can take nodes without 
becoming overloaded. Using the BFS-forest we find a path 
V = B — > • • • — s- A from an overloaded block B to A. We 
now first perform directed local search on all consecutive 
pairs of blocks in V. Here we use t — 1 for the number 
of node movements to minimize the number of non-eligible 
nodes. If this was successful, i.e. we have been able to move 
one node between all directed pairs of blocks in that path, 
we perform directed local searches as before on all pairs 
of blocks that share a non-empty boundary. Otherwise we 
undo the searches done (every node is eligible again) and 
start with the next random block that can take a node without 
becoming overloaded. 

In some rare cases the algorithm fails to find such a path, 
i.e. each time we look at a path we have one directed pair of 
blocks where no eligible node is left. An example is shown 
in Figure [6] In this case we apply a fallback balance routine 
that guarantees to reduce the total number of overloaded 
nodes by one if the input graph is connected. Given the 
BFS-forest of the quotient graph from above we look at all 
paths in it from an overloaded block to a block that can take 
a node without becoming overloaded. At this point there are 
at most order of k such paths in our BFS-forest. Specifically 
for a path V = Z — >Y— > X — > A we select a node 
having maximum gain gzy in Z and move it to Y. We then 
look at Y and do same with respect to X and so on until 
we move a node to block A. Note that this time we can 
assure to find nodes because after a node has been moved 
it is not blocked for later movements. After the operations 
have been performed they are undone and we continue with 
the next path. In the end we use the movements of the path 
that resulted in the smallest number of edges cut. 

If the graph contains more than one connected component 




Fig. 6. On top an example graph that is partitioned into three parts and 
on bottom a BFS-tree in the quotient graph starting in overloaded block B. 
It is not possible to integrate this path into the model since after directed 
local search is done on the pair (B, C), v will be marked and hence there 
is no eligible node left for the local search on the pair (C, A). A similar 
argument holds if local search is done on the pair (C, A) first. 



then the algorithms described above may not work. For 
example if there is a non-perfectly balanced block in the 
input partition that is the union of some of the graphs 
connected components. More precisely, when we want to 
integrate a path into the model we detect at some point that 
there is no path in quotient graph that contains this block 
and that can yield a balance improvement, e.g. if the block 
corresponds to a singleton in the quotient graph. To reduce 
the total number of overloaded nodes by one we do the 
following: If the block is overloaded we move a random 
node from this block to an underloaded block; otherwise we 
move a random node from an overloaded block to this block. 

Note that the advanced balancing model can contain 
conflicts, too. This is again because of the edges that run 
between the layers. We handle potential conflicts in paths 
analogously to the conflicts in the advanced model case. 

Putting Things Together: In practice we start our algo- 



rithms with an unbalanced input partition (see Section IV-C 



for more details). We define two algorithms basic and 
advanced depending on the models used. Both the basic and 
the advanced algorithm operate in rounds. In each round 
we iterate the negative cycle based local search algorithm 
until there are no negative cycles in the corresponding model 
(basic or advanced). After each negative cycle local search 
step we try to find zero weight cycles in the model to 
introduce some diversification. In Section IV- Al we also use a 
variant of the basic algorithm that does not use zero weight 
cycle diversification. Since we have random tie breaking at 
multiple places we iterate this part of the algorithm. If we 
do not succeed to find an improved cut using these two 
operations for A iterations we perform a single balancing 
step if the partition is still unbalanced otherwise we stop. 
The parameter A basically controls how fast the unbalanced 
input partition is transformed into a partition that satisfies 
the balance constraint. After the balancing operation, the 
total number of overloaded nodes is reduced by at least one 
depending on the balancing model. In the basic algorithm 
we use the basic balancing model (r = 1) and in the 



advanced algorithm we use the advanced balancing model. 
Since the balance operation can introduce new negative 
cycles in the model we start the next round. We call the 
refinement techniques introduced in this paper Karlsruhe 
Balanced Refinement (KaBaR). 

C. Integration into KaFFPaE 

We now describe how we integrate our new algorithms 
into our distributed evolutionary algorithm KaFFPaE |[T9l . 
An evolutionary algorithm starts with a population of in- 
dividuals (in our case partitions of the graph) and evolves 
the population into different populations over several rounds. 
In each round, the evolutionary algorithm uses a selection 
rule based on the fitness of the individuals (in our case 
the edge cut) of the population to select good individuals 
and combine them to obtain improved offspring. Roughly 
speaking, KaFFPaE uses KaFFPa to create individuals and 
modifies the coarsening phase to provide new effective 
combine operations. 

It is well known that allowing larger imbalance is useful 
to create good partitions |22l . l20l . Hence, we adopt this 
idea. To obtain perfectly balanced partitions we modify 
the create and combine operations as follows: each time 
we perform such an operation, we randomly choose an 
imbalance parameter e' G [0.005, e] where e is an upper 
bound for the allowed imbalance (a tuning parameter). This 
imbalance is then used to perform the operation, i.e. after 
the operation is performed, the offspring/partition has blocks 
with size at most (l + e')|~|V|/fc~|. Giving a larger imbalance 
to the operation yields smaller cuts and makes local search 
more effective since the combine and create operations use 
the multilevel graph partitioner KaFFPa. After the respective 
operation is performed we apply our advanced balancing 
and advanced negative cycle local search (including zero 
weight cycle diversification and the packing approach) to 
obtain a partition of the graph that is perfectly balanced. This 
individual is the final offspring created by the performed 
create or combine operation and inserted into the population 
using the techniques of KaFFPaE 1 19 1. Note that at all times 
each individual in the population of the evolutionary algo- 
rithm is perfectly balanced. Also note that allowing larger 
imbalance enables us to use previous developed techniques 
that otherwise would not be applicable, e.g. max-flow min- 
cut based local search methods. We call the overall algorithm 
Karlsruhe Balanced Partitioner Evolutionary (KaBaPE). As 
experiments will show in Section [V] the new kind of local 
search is also helpful if some imbalance is allowed. When 
we use KaBaPE to create e-balanced partitions we choose 
e' € [e + 0.005, e + e] for the combine and create operations. 
The created individual is then transformed into a partition 
where each block has size at most (1 + e) using our 

balancing and negative cycle local search strategies. 



D. Miscellanea 

We also tried to integrate the negative cycle detection 
strategies into the multi-level scheme of KaFFPa. However, 
experiments did not indicate large improvements and further- 
more the runtime increased drastically. This is due to the fact 
that the size of the model of the negative cycle detection 
strategies depends heavily on the sum of the weights of 
the nodes moved (the number of layers in the model is 
the maximum of the sum of the weights moved between 
a pair of blocks during construction of the directed local 
searches). Also recall that a multi-level graph partitioning 
algorithm creates a sequence of smaller graphs, e.g. by 
computing matchings and contracting matched edges. This 
kind of compression is not helpful for our model in the 
perfectly balanced refinement scheme. 

V. Experiments 

Implementation: We have implemented the algorithm 
described above using C++. We implemented negative cy- 
cle detection with subtree disassembly and distance up- 
dates as described in [9|. Overall, our program (including 
KaFFPa(E)) consists of about 23 000 lines of code. The 
implementation of the perfectly balanced local search al- 
gorithms has about 3 400 lines of code. 

System: Experiments have been done on two machines. 
Machine A has four Quad-core Opteron 8350 (2.0GHz), 
64GB RAM, running Ubuntu 10.04. Machine B is a cluster 
with 200 nodes where each node is equipped with two Quad- 
core Intel Xeon processors (X5355) which run at a clock 
speed of 2.667 GHz. Each node has 2x4 MB of level 2 
cache each and run Suse Linux Enterprise 11 SP 1. All 
nodes are attached to an InfiniBand 4X DDR interconnect 
which is characterized by its very low latency of below 
2 microseconds and a point to point bandwidth between 
two nodes of more than 1300 MB/s. All programs were 
compiled using GCC Version 4.7 and optimization level 3 
using OpenMPI 1.5.5. 

Parameters: After an extensive evaluation of the param- 
eters we fixed the number of packing iterations to p, = 20 
(larger values of p, e.g. iterating until no boundary node is 
eligible did not yield further improvements). The maximum 
number of node movements per directed local search is set 
to r = 15 for k < 8 and to r = 7 for k > 8 since 
this turned out to be a good tradeoff between quality and 
runtime. The number of unsuccessful iterations until we 
perform a balancing step A is set to three. When using 
KaBaPE to create perfectly balanced or e-balanced partitions 
we choose random values around the parameters above for 
each create or combine operation. To be more precise, each 
time we perform a create or combine operation we pick 
a random number of node movements per directed local 
search t £ [1,30], a random number of packing iterations 
/i G [1, 20] and A € [1, 10] and use these parameters for the 
balancing and negative cycle detection strategies. 
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TABLE I 

Relative number of improved instances in the Walshaw 
Benchmark. Configurations: Basic (Most Basic Negative 
Cycle Improvement), +ZeroGain (As Before Plus Zero Weight 
Cycle Diversification), Advanced (Advanced Model, 
Directed Local Searches and Zero Weight Cycle 
Diversification), +Packing (As Before Plus Packing Enabled) 

A. Walshaw Benchmark 

In this section we apply our techniques to all graphs in 
Chris Walshaw 's benchmark archive ETl . This archive is a 
collection of real-world instances for the graph partitioning 
problem. The rules used there imply that the running time 
is not an issue, but one wants to achieve minimal cut 
values for k £ {2,4,8,16,32,64} and balance parameters 
e £ {0, 0.01, 0.03, 0.05}. It is the most used graph partition- 
ing benchmark in the literature. Most of the graphs of the 
benchmark come from finite-element applications; however, 
there are also some graphs from VLSI design and a road 
network. In KaFFPa and KaFFPaE we focused on partitions 
of graphs where some imbalance, i.e. e <E {0.01, 0.03, 0.05}, 
is allowed since the techniques used therein are not made 
for the case e = 0. 

Improving Existing Partitions: When we started to look 
at perfectly balanced partitioning we counted the number of 
perfectly balanced partitions in the benchmark archive that 
contain nodes having positive gain, i.e. nodes that could 
reduce the cut when being moved to a different block. 
Astonishingly, we found that 55% of the perfectly balanced 
partitions in the archive contain nodes with positive gain 
(some of them have up to 1400 of such nodes). These nodes 
usually cannot be moved by simple local search due to 
the balance constraint. Therefore, we now use the existing 
perfectly balanced partitions in the benchmark archive and 
use them as input to our local search algorithms KaBaR. 
This experiment has been performed on machine A and for 
all configurations of the algorithm we used A = 20 for 
the number of unsuccessful tries. Table U shows the relative 
number of partitions that have been improved by different 
algorithm configurations and k (in total there are 34 graphs 
per number of blocks k). 

It is somewhat surprising that already the most basic 
variant of the algorithm, i.e. negative cycle detection with- 
out the zero weight cycle diversification mechanism, can 
improve 47% of the existing entries. All of the algorithms 
increase the number of improved partitions with increased 
number of blocks k (except the two advanced algorithms). 
Less surprisingly but still interesting to see is that the more 
advanced the local searches and models are becoming the 



more partitions can be improved. Note that it took roughly 
two hours to compute 63% entries, i.e. 128 partitions, having 
a smaller cut than reported in the archive using one core 
of machine A when applying the advanced algorithm with 
packing enabled (the most expensive configuration of the 
algorithm). This is very affordable considering the fact that 
some of the previous approaches, such as Soper et. al. 
II2TI . have taken many days to compute one entry to the 
benchmark tables. Of course in practice we want to find high 
quality partitions without using input partitions generated 
by other algorithms. We therefore compute partitions from 
scratch in the next section. 

Computing Partitions from Scratch: We now compute 
perfectly balanced partitions from scratch, i.e. we do not 
take existing partitions as input to our algorithm. To do 
so we use machine B and run KaBaPE with a time limit 
t% = 225 • k seconds using 32 cores (four nodes of the 
cluster) per graph and k > 2. On the eight largest graphs 
of the archive we gave KaBaPE a time limit of — 4 ■ tf. 
seconds per graph and k > 2. For k = 2 we gave KaBaPE 
one hour of time and 32 cores, e was set to 4% for the 
small graphs and to 3% for the eight largest graph in the 
archive. We summarize the results in Table [II] and report the 
complete list of results obtained in the Appendix. Currently 
we are able to improve or reproduce 86% of the entries 
reported in this benchmark. In the bipartition case we mostly 
reproduce the entries reported in the benchmark (instead of 
improving). This is not surprising since the models presented 
in this paper can contain only trivial cycles of length two 
in this case and our previous algorithms have shown the 
same behaviour for larger imbalance values ( |[T3l . ifTBI . 
QjQ , lfT9ll ). Also recently it has been shown by Delling 
et. al [10 1 that some of the perfectly balanced bipartitions 
reported there are optimal. We also applied our algorithm 
for larger imbalances, i.e. 1%, 3% and 5%, in the Walshaw 
Benchmark. For the case e = 1% we run our algorithm 
KaBaPE on all instances using the same parameters e and 
t% as in the perfectly balanced case. Here we are able to 
improve or reproduce the cut in 160 out of 204 cases. A 
table reporting detailed results can be found in the Appendix. 
Afterwards we performed additional partitioning trials on 
all instances where our systems (including lfl3ll . fl6l . fT8l . 
Ifl9l ) currently not have been able to reproduce or improve 
the entry reported there using different parameters and 
different machines. Doing so our systems now improved or 
reproduced 98%, 99%, 99%, 99% of the entries reported 
there for the cases e = 0,1%, 3%, 5% respectively. These 
numbers include the entries where we used the current record 
as an input to our algorithms. They contribute roughly 4%, 
7%, 11%, 9% for the cases e = 0, 1%, 3%, 5% respectively. 



B. Costs for Perfect Balance 

It is hard to perform a meaningful comparison to other 
partitioners since publicly available tools such as Scotch 
El, Jostle (23) and Metis HU are either not able to 
take the desired balance as an input parameter or are not 
able to guarantee perfect balance. This is a major problem 
for the comparison with these tools since allowing larger 
imbalances, i.e. e = 3%, decreases the number of edges 
cut significantly [20|. Hence, we have a look at the number 
of edges cut by our algorithm when perfect balance is 
enforced, i.e. the increase in the number of edges cut when 
we seek a perfectly balanced partition. To do so we use 
machine B and KaFFPaStrong to create partitions having an 
imbalance of e = 1% and then create perfectly balanced 
partitions using our advanced negative cycle model and 
advanced balancing. KaFFPaStrong is designed to achieve 
very good partition quality and is the strongest configura- 
tion of our multi-level graph partitioner KaFFPa. For each 
instance (graph, k) we repeat the experiment ten times using 
different random seeds. We then compare the final cuts of 
the perfectly balanced partitions to the number of edges cut 
before the balancing and negative cycle search started, i.e. 
when e = 1% imbalance is allowed. The instances used 
for this experiment are the same as in KaFFPa lfl8l and are 
available for download at the 10th DIM ACS Implementation 
Challenge (TJ, (2). The main properties of these graphs 



are summarized in the Appendix. Table III summarizes the 
results of the experiment, detailed results are reported in the 
Appendix. On average the number of edges cut increased by 
roughly 6% when enforcing perfect balance and the runtime 
of the negative cycle local search and balancing strategies is 
comparable with the average runtime of KaFFPaStrong. 
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TABLE II 

Number of improvements computed from scratch e = 0. 



TABLE III 

Costs for Perfect Balance, Rel. to KaFFPaStrong when 
e = 1% imbalance is allowed. Rel. Cut reports the average 

increase in the cut after the 1% partitions have been 
balanced and Rel. time reports the average time used by 
KaBaR relative to the runtime of KaFFPaStrong. 



VI. Conclusion and Future Work 

In this paper we have presented novel algorithms to 
tackle the perfectly balanced graph partitioning problem. 
These algorithms are able to combine local searches by 
a model in which a cycle corresponds to a set of node 
movements in the original partitioned graph that roughly 
speaking do not alter the balance of the partition. Here 
we demonstrated that a negative cycle detection algorithm 
is very well suited to find cycles in our model that are 
associated with node movements decreasing the overall cut. 
Experiments indicate that previous algorithms have not been 
able to find such rather complex movements. An integration 
into our parallel multi-level evolutionary algorithm is able 



to improve or reproduce most of the entries reported in 
the Walshaw Benchmark in a reasonable amount of time. 
Additionally the algorithm is also useful if some imbalance 
is allowed. In contrast to previous algorithms such as Scotch 
ifTTl . Jostle ||231 and Metis fl4l . our algorithms are able to 
guarantee that the output partition is feasible. 

An open question is whether it is possible to define a 
conflict-free model that encodes the same kind of node 
movements as our advanced model. In future work, it could 
be interesting to see if one can integrate other type of local 
searches from KaFFPa ifTHI into our models. The packing 
algorithm can be improved such that instead of simply 
picking the best directed local search moving d nodes, it 
could find the best combination of the computed directed 
local searches to move d nodes. It will be also interesting to 
see whether our techniques are applicable to other problems 
where local search is restricted by constraints. For example 
this kind of local search could be very interesting in the 
area of multi-constraint graph partitioning or in the area of 
hypergraph partitioning. 
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Instances used in the cost for perfect balance experiment. 
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TABLE V 

Detailed results of the cost for perfect balance experiment. 
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TABLE VI 

Computing partitions from scratch e = 0%. In each /c-column the results computed by KaBaPE are on the left and the current Walshaw cuts are presented on the 
right side. Entries marked with a * can be improved by our refinement techniques when using the current record as input. Entries marked with a + have been 

obtained using different parameters and seeds. @ indicates that we hold this record. 
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TABLE VII 

Computing partitions from scratch e = 1%. In each /c-column the results computed by KaBaPE are on the left and the current Walshaw cuts are presented on the 
right side. Entries marked with a * can be improved by our refinement techniques when using the current record as input. Entries marked with a + have been 

obtained using different parameters and seeds. @ indicates that we hold this record. 



